20 research outputs found
Automated Segmentation of Pulmonary Lobes using Coordination-Guided Deep Neural Networks
The identification of pulmonary lobes is of great importance in disease
diagnosis and treatment. A few lung diseases have regional disorders at lobar
level. Thus, an accurate segmentation of pulmonary lobes is necessary. In this
work, we propose an automated segmentation of pulmonary lobes using
coordination-guided deep neural networks from chest CT images. We first employ
an automated lung segmentation to extract the lung area from CT image, then
exploit volumetric convolutional neural network (V-net) for segmenting the
pulmonary lobes. To reduce the misclassification of different lobes, we
therefore adopt coordination-guided convolutional layers (CoordConvs) that
generate additional feature maps of the positional information of pulmonary
lobes. The proposed model is trained and evaluated on a few publicly available
datasets and has achieved the state-of-the-art accuracy with a mean Dice
coefficient index of 0.947 0.044.Comment: ISBI 2019 (Oral
Hypergraph Transformer for Skeleton-based Action Recognition
Skeleton-based action recognition aims to predict human actions given human
joint coordinates with skeletal interconnections. To model such off-grid data
points and their co-occurrences, Transformer-based formulations would be a
natural choice. However, Transformers still lag behind state-of-the-art methods
using graph convolutional networks (GCNs). Transformers assume that the input
is permutation-invariant and homogeneous (partially alleviated by positional
encoding), which ignores an important characteristic of skeleton data, i.e.,
bone connectivity. Furthermore, each type of body joint has a clear physical
meaning in human motion, i.e., motion retains an intrinsic relationship
regardless of the joint coordinates, which is not explored in Transformers. In
fact, certain re-occurring groups of body joints are often involved in specific
actions, such as the subconscious hand movement for keeping balance. Vanilla
attention is incapable of describing such underlying relations that are
persistent and beyond pair-wise. In this work, we aim to exploit these unique
aspects of skeleton data to close the performance gap between Transformers and
GCNs. Specifically, we propose a new self-attention (SA) extension, named
Hypergraph Self-Attention (HyperSA), to incorporate inherently higher-order
relations into the model. The K-hop relative positional embeddings are also
employed to take bone connectivity into account. We name the resulting model
Hyperformer, and it achieves comparable or better performance w.r.t. accuracy
and efficiency than state-of-the-art GCN architectures on NTU RGB+D, NTU RGB+D
120, and Northwestern-UCLA datasets. On the largest NTU RGB+D 120 dataset, the
significantly improved performance reached by our Hyperformer demonstrates the
underestimated potential of Transformer models in this field
DDColor: Towards Photo-Realistic Image Colorization via Dual Decoders
Image colorization is a challenging problem due to multi-modal uncertainty
and high ill-posedness. Directly training a deep neural network usually leads
to incorrect semantic colors and low color richness. While transformer-based
methods can deliver better results, they often rely on manually designed
priors, suffer from poor generalization ability, and introduce color bleeding
effects. To address these issues, we propose DDColor, an end-to-end method with
dual decoders for image colorization. Our approach includes a pixel decoder and
a query-based color decoder. The former restores the spatial resolution of the
image, while the latter utilizes rich visual features to refine color queries,
thus avoiding hand-crafted priors. Our two decoders work together to establish
correlations between color and multi-scale semantic representations via
cross-attention, significantly alleviating the color bleeding effect.
Additionally, a simple yet effective colorfulness loss is introduced to enhance
the color richness. Extensive experiments demonstrate that DDColor achieves
superior performance to existing state-of-the-art works both quantitatively and
qualitatively. The codes and models are publicly available at
https://github.com/piddnad/DDColor.Comment: ICCV 2023; Code: https://github.com/piddnad/DDColo
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN
Masked image modeling (MIM), an emerging self-supervised pre-training method,
has shown impressive success across numerous downstream vision tasks with
Vision transformers (ViTs). Its underlying idea is simple: a portion of the
input image is randomly masked out and then reconstructed via the pre-text
task. However, the working principle behind MIM is not well explained, and
previous studies insist that MIM primarily works for the Transformer family but
is incompatible with CNNs. In this paper, we first study interactions among
patches to understand what knowledge is learned and how it is acquired via the
MIM task. We observe that MIM essentially teaches the model to learn better
middle-order interactions among patches and extract more generalized features.
Based on this fact, we propose an Architecture-Agnostic Masked Image Modeling
framework (AMIM), which is compatible with both Transformers and CNNs in a
unified way. Extensive experiments on popular benchmarks show that our AMIM
learns better representations without explicit design and endows the backbone
model with the stronger capability to transfer to various downstream tasks for
both Transformers and CNNs.Comment: Preprint under review (update reversion). The source code will be
released in https://github.com/Westlake-AI/openmixu
PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction
Multi-person motion prediction is a challenging task, especially for
real-world scenarios of densely interacted persons. Most previous works have
been devoted to studying the case of weak interactions (e.g., hand-shaking),
which typically forecast each human pose in isolation. In this paper, we focus
on motion prediction for multiple persons with extreme collaborations and
attempt to explore the relationships between the highly interactive persons'
motion trajectories. Specifically, a novel cross-query attention (XQA) module
is proposed to bilaterally learn the cross-dependencies between the two pose
sequences tailored for this situation. Additionally, we introduce and build a
proxy entity to bridge the involved persons, which cooperates with our proposed
XQA module and subtly controls the bidirectional information flows, acting as a
motion intermediary. We then adapt these designs to a Transformer-based
architecture and devise a simple yet effective end-to-end framework called
proxy-bridged game Transformer (PGformer) for multi-person interactive motion
prediction. The effectiveness of our method has been evaluated on the
challenging ExPI dataset, which involves highly interactive actions. We show
that our PGformer consistently outperforms the state-of-the-art methods in both
short- and long-term predictions by a large margin. Besides, our approach can
also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets
and achieve encouraging results. Our code will become publicly available upon
acceptance
Discovering Physical Interaction Vulnerabilities in IoT Deployments
Internet of Things (IoT) applications drive the behavior of IoT deployments
according to installed sensors and actuators. It has recently been shown that
IoT deployments are vulnerable to physical interactions, caused by design flaws
or malicious intent, that can have severe physical consequences. Yet, extant
approaches to securing IoT do not translate the app source code into its
physical behavior to evaluate physical interactions. Thus, IoT consumers and
markets do not possess the capability to assess the safety and security risks
these interactions present. In this paper, we introduce the IoTSeer security
service for IoT deployments, which uncovers undesired states caused by physical
interactions. IoTSeer operates in four phases (1) translation of each actuation
command and sensor event in an app source code into a hybrid I/O automaton that
defines an app's physical behavior, (2) combining apps in a novel composite
automaton that represents the joint physical behavior of interacting apps, (3)
applying grid-based testing and falsification to validate whether an IoT
deployment conforms to desired physical interaction policies, and (4)
identification of the root cause of policy violations and proposing patches
that guide users to prevent them. We use IoTSeer in an actual house with 13
actuators and six sensors with 37 apps and demonstrate its effectiveness and
performance
DAMO-StreamNet: Optimizing Streaming Perception in Autonomous Driving
Real-time perception, or streaming perception, is a crucial aspect of
autonomous driving that has yet to be thoroughly explored in existing research.
To address this gap, we present DAMO-StreamNet, an optimized framework that
combines recent advances from the YOLO series with a comprehensive analysis of
spatial and temporal perception mechanisms, delivering a cutting-edge solution.
The key innovations of DAMO-StreamNet are: (1) A robust neck structure
incorporating deformable convolution, enhancing the receptive field and feature
alignment capabilities. (2) A dual-branch structure that integrates short-path
semantic features and long-path temporal features, improving motion state
prediction accuracy. (3) Logits-level distillation for efficient optimization,
aligning the logits of teacher and student networks in semantic space. (4) A
real-time forecasting mechanism that updates support frame features with the
current frame, ensuring seamless streaming perception during inference. Our
experiments demonstrate that DAMO-StreamNet surpasses existing state-of-the-art
methods, achieving 37.8% (normal size (600, 960)) and 43.3% (large size (1200,
1920)) sAP without using extra data. This work not only sets a new benchmark
for real-time perception but also provides valuable insights for future
research. Additionally, DAMO-StreamNet can be applied to various autonomous
systems, such as drones and robots, paving the way for real-time perception.
The code is available at https://github.com/zhiqic/DAMO-StreamNet
Parallel Aligned Treebank Corpora at LDC: Methodology, Annotation and Integration
Proceedings of the Workshop on Annotation and
Exploitation of Parallel Corpora AEPC 2010.
Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk.
NEALT Proceedings Series, Vol. 10 (2010), 14-23.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15893
KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration
In the realm of facial analysis, accurate landmark detection is crucial for
various applications, ranging from face recognition and expression analysis to
animation. Conventional heatmap or coordinate regression-based techniques,
however, often face challenges in terms of computational burden and
quantization errors. To address these issues, we present the KeyPoint
Positioning System (KeyPosS) - a groundbreaking facial landmark detection
framework that stands out from existing methods. The framework utilizes a fully
convolutional network to predict a distance map, which computes the distance
between a Point of Interest (POI) and multiple anchor points. These anchor
points are ingeniously harnessed to triangulate the POI's position through the
True-range Multilateration algorithm. Notably, the plug-and-play nature of
KeyPosS enables seamless integration into any decoding stage, ensuring a
versatile and adaptable solution. We conducted a thorough evaluation of
KeyPosS's performance by benchmarking it against state-of-the-art models on
four different datasets. The results show that KeyPosS substantially
outperforms leading methods in low-resolution settings while requiring a
minimal time overhead. The code is available at
https://github.com/zhiqic/KeyPosS.Comment: Accepted to ACM Multimedia 2023; 10 pages, 7 figures, 6 tables; the
code is at https://github.com/zhiqic/KeyPos